Multivariable Logistic Regression Analysis

Authors

MOHD FITTRI FAHMI BIN FAUZI (23202545)

MOHD FAUZIE BIN ISMAIL (23202542)

AHMAD SHAHRIL HAFIFI BIN SAAD (23202563)

Published

January 4, 2026

1 Part A: Introduction and Data Generation Process

The primary objective is to fit a logistic regression model to estimate the association between depression (binary outcome) and three key covariates:

  • Years Working: A proxy for cumulative occupational stress or burnout.

  • Physical Activity: A well-established protective factor against mental health disorders.

  • Obesity: A metabolic risk factor often comorbid with depression

To ensure a robust analysis, we generated a synthetic dataset of n=200 observations. The data generation process was designed to provide broad coverage across plausible exposure ranges while maintaining realistic outcome probabilities.

1.1 Data Generation Methodology and Justification

  • Covariate Distributions: Continuous predictors were sampled uniformly across their plausible ranges (years_working 0–40, phys_activity 0–10) to ensure balanced coverage of the exposure space. Obesity (binary) is modeled using a binomial distribution with probability 0.35. These ranges are consistent with typical working-life durations (early career through retirement) and common physical activity scoring scales used in population surveys, making the simulated covariates interpretable in realistic units. A baseline obesity prevalence around 30–40% is also broadly plausible in many adult populations, while still providing enough obese participants for stable interaction estimation.

  • Logistic Model Simulation: The binary outcome (depression) was generated using a binomial process based on calculated log-odds.

  • Coefficients: We assigned a positive coefficient to obesity and years_working (risk factors) and a negative coefficient to phys_activity (protective factor). Coefficient magnitudes were set to yield effect sizes commonly seen in applied epidemiology (e.g., modest per-year changes for years worked, stronger gradients for behavioral protective factors) while keeping predicted risks away from near-certain outcomes. This produces odds ratios that are neither trivially small nor unrealistically large for an N=200 teaching dataset.

  • Interaction: A positive interaction term was included to model the hypothesis that the protective effect of physical activity is diminished in the presence of obesity.

  • Centering and Scaling: Before calculating probabilities, continuous predictors were centered and scaled. This prevents numerical instability and extreme probabilities (0 or 1), ensuring the logistic model converges

  • Influence Robustness: A common issue in small datasets (N=200) is that significance is driven by 1 or 2 outliers. The code calculates Cook’s Distance and ensures that the main effects and interactions remain significant (p<0.05) even after removing influential points.

The code is provided below to demonstrate the reproducible data generation process. A short DGP parameter table is included for clarity.

Code
library(tibble)
library(knitr)

dgp_table <- tibble(
  Component = c(
    "Sample size",
    "years_working",
    "phys_activity",
    "obesity",
    "Intercept (b0)",
    "Years working (b1)",
    "Physical activity (b2)",
    "Obesity main effect (b3)",
    "Interaction: phys_activity x obesity (b4)",
    "Scaling in linear predictor"
  ),
  Specification = c(
    "n = 200",
    "Uniform(0, 40)",
    "Uniform(0, 10)",
    "Binomial(n =1, p = 0.35)",
    "-1.10",
    "0.85",
    "-0.55",
    "1.15",
    "0.35",
    "years_working/10 - 2; phys_activity - 5"
  )
)

kable(dgp_table, align = c("l", "l"))
Table 1: DGP Parameters and Distributions Used for Simulation
Component Specification
Sample size n = 200
years_working Uniform(0, 40)
phys_activity Uniform(0, 10)
obesity Binomial(n =1, p = 0.35)
Intercept (b0) -1.10
Years working (b1) 0.85
Physical activity (b2) -0.55
Obesity main effect (b3) 1.15
Interaction: phys_activity x obesity (b4) 0.35
Scaling in linear predictor years_working/10 - 2; phys_activity - 5

Data Generation Code:

Code
## ------------------------------------------------------------
## Reproducible simulation of a logistic-regression DGP dataset
## with a phys_activity × obesity interaction, N = 200
##
## Goals (enforced in code):
##  - years_working, phys_activity, obesity, and interaction all p < 0.05
##  - realistic directions: more years_working ↑ depression, more activity ↓,
##    obesity ↑, and activity remains protective even among obese (interaction offsets)
##  - influential points (by Cook's distance) removed -> significance remains
##  - logistic regression assumptions checked 
## ------------------------------------------------------------

set.seed(20251229)

n <- 200

## Helper: compute VIFs without extra packages (works for numeric design columns)
vif_manual <- function(X) {
  # data.frame or matrix of numeric predictors (no intercept)
  X <- as.data.frame(X)
  out <- numeric(ncol(X))
  names(out) <- colnames(X)
  for (j in seq_len(ncol(X))) {
    yj <- X[[j]]
    Xj <- X[-j]
    r2 <- summary(lm(yj ~ ., data = Xj))$r.squared
    out[j] <- 1 / (1 - r2)
  }
  out
}

## Helper: Box–Tidwell linearity-in-the-logit check for continuous predictors
box_tidwell_check <- function(dat) {
  # ensure strictly positive for log() by adding small offsets
  yw_pos <- dat$years_working + 0.5
  pa_pos <- dat$phys_activity + 0.5

  fit_bt <- glm(
    depressed_num ~ years_working + phys_activity + obesity +
      phys_activity:obesity +
      I(years_working * log(yw_pos)) +
      I(phys_activity * log(pa_pos)),
    data = dat, family = binomial()
  )

  p_bt <- coef(summary(fit_bt))[c("I(years_working * log(yw_pos))",
                                 "I(phys_activity * log(pa_pos))"), "Pr(>|z|)"]
  list(model = fit_bt, p_values = p_bt)
}

## Fixed, calibrated coefficient values (DGP on centered/scaled predictors)
## Interpretation:
##  - years_working: positive (more years -> higher odds)
##  - phys_activity: negative (more activity -> lower odds)
##  - obesity: positive (obese -> higher odds)
##  - interaction: positive offset (activity is still protective in obese,
##    but less strongly than in non-obese)
b0 <- -1.10
b1 <-  0.85
b2 <- -0.55
b3 <-  1.15
b4 <-  0.35

## Simulation + validation loop (deterministic given set.seed)
## We loop only to ensure the sample satisfies the strict criteria.
max_iter <- 5000

simulate_and_validate <- function() {
  years_working  <- runif(n, min = 0, max = 40)
  phys_activity  <- runif(n, min = 0, max = 10)
  obesity        <- factor(rbinom(n, 1, 0.35),
                           levels = c(0, 1),
                           labels = c("not obese", "obese"))

  # Centered/scaled versions used for the DGP (keeps probabilities away from 0/1)
  yw_c <- years_working / 10 - 2   # range roughly [-2, +2]
  pa_c <- phys_activity - 5        # range [-5, +5]
  obese01 <- as.integer(obesity == "obese")

  # Linear predictor and probabilities
  eta <- b0 + b1 * yw_c + b2 * pa_c + b3 * obese01 + b4 * (pa_c * obese01)
  p   <- plogis(eta)

  # Outcome generation
  depressed_num <- rbinom(n, 1, p)
  depression <- factor(ifelse(depressed_num == 1, "depressed", "not depressed"),
                       levels = c("depressed", "not depressed"))

  dat <- data.frame(
    depression     = depression,
    years_working  = years_working,
    phys_activity  = phys_activity,
    obesity        = obesity,
    depressed_num  = depressed_num
  )

  # Fit the multivariable logistic regression requested (with interaction)
  fit <- glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,
             data = dat, family = binomial())

  sm <- summary(fit)$coefficients
  p_main <- sm[c("years_working", "phys_activity", "obesityobese", "phys_activity:obesityobese"),
               "Pr(>|z|)"]

  # Convergence and no extreme fitted probs
  if (!isTRUE(fit$converged)) return(NULL)
  phat <- fitted(fit)
  if (any(phat < 0.01 | phat > 0.99)) return(NULL)  # guards against near-separation

  # Ensure adequate events (helps stability of inference)
  n_events <- sum(dat$depressed_num == 1)
  if (n_events < 50 || n_events > 150) return(NULL)

  # Influence robustness
  cd <- cooks.distance(fit)
  infl_idx <- which(cd > (4 / n))
  dat_no_infl <- if (length(infl_idx) > 0) dat[-infl_idx, , drop = FALSE] else dat

  fit_no_infl <- glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,
                     data = dat_no_infl, family = binomial())
  sm2 <- summary(fit_no_infl)$coefficients
  p_main_no_infl <- sm2[c("years_working", "phys_activity", "obesityobese", "phys_activity:obesityobese"),
                        "Pr(>|z|)"]

  # Multicollinearity check using manual VIF on the numeric design columns
  X <- model.matrix(~ years_working + phys_activity + obesity + phys_activity:obesity,
                    data = dat)[, -1, drop = FALSE]  # remove intercept
  vifs <- vif_manual(as.data.frame(X))
  if (any(vifs > 5)) return(NULL)

  # Linearity in the logit for continuous predictors (Box–Tidwell):
  # for a correct linear logit, the added terms should be non-significant.
  bt <- box_tidwell_check(dat)
  if (any(bt$p_values < 0.05)) return(NULL)

  # Significance criteria in the original fit AND after removing influentials
  if (all(p_main < 0.05) && all(p_main_no_infl < 0.05)) {
    list(
      data = dat[, c("depression", "years_working", "phys_activity", "obesity")],
      fit = fit,
      fit_no_infl = fit_no_infl,
      influentials_removed = infl_idx,
      p_values = p_main,
      p_values_no_infl = p_main_no_infl,
      vifs = vifs,
      box_tidwell_p = bt$p_values
    )
  } else {
    NULL
  }
}

res <- NULL
for (iter in 1:max_iter) {
  res <- simulate_and_validate()
  if (!is.null(res)) break
}

if (is.null(res)) {
  stop("Failed to generate a dataset meeting all criteria within max_iter. Try increasing max_iter or adjusting coefficients.")
}

## ------------------------------------------------------------
## Final dataset 
## ------------------------------------------------------------
data <- res$data



# Write csv
write.csv(data, "data_logistic_regression.csv", row.names = FALSE)

1.2 Glimpse the generated dataset

Code
library(dplyr)
glimpse(data)
Rows: 200
Columns: 4
$ depression    <fct> not depressed, depressed, not depressed, depressed, not …
$ years_working <dbl> 3.567682, 29.181835, 21.448700, 38.122078, 18.223070, 34…
$ phys_activity <dbl> 2.0521387, 3.8608661, 5.3713262, 6.0407881, 9.0055039, 7…
$ obesity       <fct> obese, obese, not obese, obese, not obese, not obese, ob…

Comment: The variables are correctly typed (continuous predictors and categorical outcome), confirming the dataset is ready for factor recoding and regression modeling.

1.3 Setting up the Environment

1.3.1 Load Required Libraries

library(tidyverse)
library(dplyr)
library(broom)
library(modelr)
library(ggplot2)
library(GGally)
library(patchwork)
library(knitr)
library(broom.helpers)
library(gt)
library(gtsummary)

1.3.2 Transform data

1.3.2.1 Change obesity to factors and set reference level

As obesity is a categorical variable, we convert it to a factor and set “not obese” as the reference level.

Code
data <- data |> 
    mutate(obesity = factor(obesity,levels = c("not obese", "obese"), labels = c("Not Obese", "Obese")))
glimpse(data)
Rows: 200
Columns: 4
$ depression    <fct> not depressed, depressed, not depressed, depressed, not …
$ years_working <dbl> 3.567682, 29.181835, 21.448700, 38.122078, 18.223070, 34…
$ phys_activity <dbl> 2.0521387, 3.8608661, 5.3713262, 6.0407881, 9.0055039, 7…
$ obesity       <fct> Obese, Obese, Not Obese, Obese, Not Obese, Not Obese, Ob…

1.3.2.2 Change depression to 0 and 1 and set reference level

We convert depression to a factor with “Not Depressed” as the reference level for descriptive analysis purpose. However, to run the logistic regression later, we need to convert it to numeric 0 and 1.

Code
data <- data |> 
    mutate(depression = factor(ifelse(depression == "depressed", 1, 0),
                               levels = c(0, 1),
                               labels = c("Not Depressed", "Depressed")))
glimpse(data)
Rows: 200
Columns: 4
$ depression    <fct> Not Depressed, Depressed, Not Depressed, Depressed, Not …
$ years_working <dbl> 3.567682, 29.181835, 21.448700, 38.122078, 18.223070, 34…
$ phys_activity <dbl> 2.0521387, 3.8608661, 5.3713262, 6.0407881, 9.0055039, 7…
$ obesity       <fct> Obese, Obese, Not Obese, Obese, Not Obese, Not Obese, Ob…

1.3.3 Label variables

Next, we add labels to the variables for better interpretability in tables and plots.

library(labelled)
# Add labels to variables
var_label(data$depression) <- "Depression Status"
var_label(data$years_working) <- "Years Working"
var_label(data$phys_activity) <- "Physical Activity Score"
var_label(data$obesity) <- "Obesity Status"

2 Part B : Exploratory Data Analysis

We now perform exploratory data analysis (EDA) to understand the characteristics of the dataset and the relationships between variables.

2.1 Summary Statistics

Code
library(summarytools)
print(dfSummary(data,  
                style        = "grid",  
                varnumbers   = FALSE,
                valid.col    = FALSE
                ),
      method = "render")

Data Frame Summary

data

Dimensions: 200 x 4
Duplicates: 0
Variable Label Stats / Values Freqs (% of Valid) Graph Missing
depression [factor] Depression Status
1. Not Depressed
2. Depressed
121 ( 60.5% )
79 ( 39.5% )
0 (0.0%)
years_working [numeric] Years Working
Mean (sd) : 19.1 (11.1)
min ≤ med ≤ max:
0.1 ≤ 18.9 ≤ 39.9
IQR (CV) : 19.7 (0.6)
200 distinct values 0 (0.0%)
phys_activity [numeric] Physical Activity Score
Mean (sd) : 5.3 (2.9)
min ≤ med ≤ max:
0 ≤ 5.5 ≤ 10
IQR (CV) : 4.9 (0.6)
200 distinct values 0 (0.0%)
obesity [factor] Obesity Status
1. Not Obese
2. Obese
120 ( 60.0% )
80 ( 40.0% )
0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.5.1)
2026-01-04

Comment: The dataset consists of 200 observations with complete data (0.0% missingness) across all variables. The outcome variable, depression, is binary with 39.5% (n=79) of individuals classified as depressed. The continuous predictors, years_working and phys_activity, exhibit means of 19.11 years (SD=11.13) and 5.26 (SD=2.90) respectively, indicating a moderate level of physical activity in the sample. The categorical predictor, obesity, shows that 40.0% (n=80) of participants are classified as obese, suggesting a substantial prevalence of obesity within this population.

2.2 Visualizations

Graphical examination of variable distributions and relationships complements summary statistics by revealing patterns, outliers, and potential non-linearities that may influence model specification. The following visualizations assess distributional assumptions, examine bivariate relationships stratified by categorical predictors, and explore crude associations with the depression outcome.

2.2.1 Histograms for Continuous Variables

Code
# Create individual histograms with improved aesthetics
p_years <- ggplot(data, aes(x = years_working)) +
  geom_histogram(fill = "#457B9D", color = "black", bins = 30, alpha = 0.9) +
  theme_minimal() +
  labs(title = "Years Working",
       x = "Years",
       y = "Frequency")

p_activity <- ggplot(data, aes(x = phys_activity)) +
  geom_histogram(fill = "#E63946", color = "black", bins = 30, alpha = 0.9) +
  theme_minimal() +
  labs(title = "Physical Activity Score",
       x = "Score",
       y = "Frequency")



# Combine plots
p_years + p_activity
Figure 1: Distribution of Continuous Variables

Comment: Years working and physical activity both show approximately symmetric spreads centered near the mid-range, with few extreme tails. The absence of heavy skew or spikes suggests the variables can enter the logit in their current scale before formal linearity checks.

2.2.2 Box Plots of Continuous Variables by Obesity

Code
# Years working by obesity
p_years_obesity <- ggplot(data, aes(x = obesity, y = years_working, fill = obesity)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Years Working by Obesity Status",
       x = "Obesity Status",
       y = "Years") +
  theme(legend.position = "none")

# Physical activity by obesity
p_activity_obesity <- ggplot(data, aes(x = obesity, y = phys_activity, fill = obesity)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Physical Activity by Obesity Status",
       x = "Obesity Status",
       y = "Score") +
  theme(legend.position = "none")

# Combine plots
p_years_obesity + p_activity_obesity
Figure 2: Box Plots of Continuous Variables by Obesity Status

Comment: The box plots reveal that median years working is similar between obesity groups, though obese participants show a slightly wider interquartile range. In contrast, median physical activity is slightly higher among obese participants compared to non-obese participants, with both groups displaying similar spreads.

2.2.3 Box Plots of Continuous Variables by Depression

Code
# Years working by depression
p_years_depression <- ggplot(data, aes(x = depression, y = years_working, fill = depression)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Years Working by Depression Status",
       x = "Depression Status",
       y = "Years") +
  theme(legend.position = "none")

# Physical activity by depression
p_activity_depression <- ggplot(data, aes(x = depression, y = phys_activity, fill = depression)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Physical Activity by Depression Status",
       x = "Depression Status",
       y = "Score") +
  theme(legend.position = "none")

# Combine plots
p_years_depression + p_activity_depression
Figure 3: Box Plots of Continuous Variables by Depression Status

Comment: Depressed participants tend to have longer job tenure and lower physical activity than their non-depressed peers, aligning with the hypothesized risk and protective directions. The overlapping IQRs justify moving to multivariable modeling rather than relying on crude group comparisons alone.

2.2.4 Scatter Plot Between Years Working and Physical Activity

Code
p_yr_pa <- ggplot(data, aes(x = years_working, y = phys_activity)) +
  geom_point(color = "#E63946", alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", se = TRUE,
              color = "#1D3557", fill = scales::alpha("#457B9D", 0.3),
              linewidth = 1.2) +
  theme_minimal() +
  labs(title = "Physical Activity vs Years Working",
       x = "Years Working",
       y = "Physical Activity Score")
p_yr_pa
Figure 4: Scatter Plot Between Years Working and Physical Activity

Comment: The covariates themselves are only weakly related; the slight downward trend between years working and physical activity suggests limited collinearity, which supports stable coefficient estimation in the multivariable logit.

2.2.5 Bar Plots for Categorical Variables

Code
p_obesity <- ggplot(data, aes(x = obesity, fill = obesity)) +
  geom_bar() +
  geom_text(stat = "count", 
            aes(label = paste0(after_stat(count), "\n(", 
                              round(after_stat(count)/sum(after_stat(count))*100, 1), "%)")), 
            vjust = 1.5, color = "white", fontface = "bold", size = 5) +
  theme_minimal() +
  labs(title = "Distribution of Obesity Status",
       x = "Obesity Status",
       y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none")

p_depression <- ggplot(data, aes(x = depression, fill = depression)) +
  geom_bar() +
  geom_text(stat = "count", 
            aes(label = paste0(after_stat(count), "\n(", 
                              round(after_stat(count)/sum(after_stat(count))*100, 1), "%)")), 
            vjust = 1.5, color = "white", fontface = "bold", size = 5) +
  theme_minimal() +
  labs(title = "Distribution of Depression Status",
       x = "Depression Status",
       y = "Count") +
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "none")

p_obesity + p_depression
Figure 5: Distribution of Categorical Variables

Comment: The not obese group represents the majority of observations (60.0%, n=120), while the obese group comprises the remaining 40.0% (n=80). Despite this imbalance, the minority group maintains a sufficient sample size to support robust parameter estimation and interaction testing.

2.3 Depression prevalence across numerical predictors gradient

Examining crude depression prevalence across binned continuous predictors provides preliminary evidence of dose-response relationships before formal regression modeling. These stratified prevalence estimates reveal whether depression risk varies systematically with increasing exposure levels, informing expectations for the direction and magnitude of associations in adjusted models.

Code
data |>
  mutate(yw_bin = cut(years_working, 
                      breaks = seq(0, 40, by = 5), 
                      include.lowest = TRUE)) |>
  summarize(prev = mean(depression == "Depressed"), 
            n = n(), 
            .by = yw_bin) |>
  ggplot(aes(x = yw_bin, y = prev)) +
  geom_col(fill = "#457B9D") +
  geom_text(aes(label = paste0("n=", n)), vjust = -0.4, size = 3) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), 
                     expand = expansion(mult = c(0, 0.12))) +
  labs(x = "Years working (5-year bins)", 
       y = "Depression prevalence") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 6: Depression prevalence across binned years working
Code
data |>
  mutate(pa_bin = cut(phys_activity, 
                      breaks = seq(0, 10, by = 1), 
                      include.lowest = TRUE)) |>
  summarize(prev = mean(depression == "Depressed"), 
            n = n(), 
            .by = pa_bin) |>
  ggplot(aes(x = pa_bin, y = prev)) +
  geom_col(fill = "#E63946") +
  geom_text(aes(label = paste0("n=", n)), vjust = -0.4, size = 3) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), 
                     expand = expansion(mult = c(0, 0.12))) +
  labs(x = "Physical activity (1-unit bins)", 
       y = "Depression prevalence") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 7: Depression prevalence across binned physical activity

Interpretation: These crude prevalence plots suggest that depression is not evenly distributed across the ranges of years working and physical activity. Across years working, prevalence generally appears higher in later career bands than in early career bands, consistent with a potential accumulation of occupational stress over time, although the pattern is not perfectly smooth because each bin contains a limited number of observations. Across physical activity, prevalence tends to be lower at higher activity levels, suggesting a protective association, but with some irregular ups and downs across adjacent bins that likely reflect sampling variability and the arbitrariness of cut points. We observe that these visual risk gradients provide preliminary evidence for the directionality of associations we expect to test formally.

2.4 Correlation Matrix using corrplot

Code
library(corrplot)
# Select continuous variables
continuous_vars <- data %>%
  select(years_working, phys_activity)

# Compute correlation matrix
cor_matrix <- cor(continuous_vars, use = "complete.obs")

# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper",
         tl.col = "black", tl.srt = 45,
         addCoef.col = "black", number.cex = 0.7,
         title = "Correlation Matrix of Continuous Variables",
         mar = c(0,0,1,0))
Figure 8: Correlation Matrix of Continuous Variables

Comment: The correlation between years working and physical activity is very small (r ≈ -0.04), reinforcing that the predictors contribute distinct information and reducing concern about multicollinearity ahead of the formal VIF check.

2.5 Descriptive Statistics by Depression Status

Code
library(gtsummary)
tbl1 <- data |> 
  select(depression, years_working, phys_activity, obesity) |> 
  tbl_summary(by = depression,
              statistic = list(
                all_continuous() ~ "{mean} ({sd})",
                all_categorical() ~ "{n} ({p}%)"
              ),
              digits = all_continuous() ~ 2,
              missing = "no") |>
  add_overall() |>
  modify_header(label ~ "**Variable**") |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Depression Status**") |>
  bold_labels() |> 
  as_gt() |> 
  tab_options(
    table.width = pct(100)
  ) |> 
  tab_style(
  style = cell_text(align = "left"),
  locations = cells_footnotes()
  )
tbl1
Table 2: Descriptive Statistics by Depression Status
Variable Overall
N = 2001
Depression Status
Not Depressed
N = 1211
Depressed
N = 791
Years Working 19.11 (11.13) 17.09 (10.90) 22.21 (10.82)
Physical Activity Score 5.26 (2.90) 5.96 (2.82) 4.18 (2.71)
Obesity Status


    Not Obese 120 (60%) 82 (68%) 38 (48%)
    Obese 80 (40%) 39 (32%) 41 (52%)
1 Mean (SD); n (%)

Interpretation: Compared with the non-depressed group, the depressed group tends to report fewer physical activity points and slightly longer years working, while obesity is more prevalent. These unadjusted contrasts motivate multivariable modeling to separate each predictor’s independent contribution.

3 Part C: Regression Analysis

This section presents the core statistical modeling to quantify associations between depression and the hypothesized risk factors. We begin with univariable logistic regression models to establish crude associations, then proceed to multivariable models that adjust for confounding and test for effect modification. Model comparison techniques identify the specification that best balances parsimony and explanatory power while maintaining interpretability.

3.1 Change depression to numeric 0 and 1 for regression analysis

Code
data <- data |> 
    mutate(depression = as.numeric(depression) - 1)
glimpse(data)
Rows: 200
Columns: 4
$ depression    <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,…
$ years_working <dbl> 3.567682, 29.181835, 21.448700, 38.122078, 18.223070, 34…
$ phys_activity <dbl> 2.0521387, 3.8608661, 5.3713262, 6.0407881, 9.0055039, 7…
$ obesity       <fct> Obese, Obese, Not Obese, Obese, Not Obese, Not Obese, Ob…

3.2 Simple Logistic Regression Models

Univariate logistic regression models estimate the crude association between each predictor and depression without adjustment for other covariates. These models establish baseline effect estimates and significance levels that are compared against adjusted estimates to assess confounding and identify predictors warranting inclusion in the multivariable model.

Code
uvtable <- data |>
  tbl_uvregression(
    method = glm,
    y = "depression",
    exponentiate = TRUE,
    conf.level = 0.95,
    pvalue_fun = ~format.pval(.x, digits = 2, eps = 0.001),
    hide_n = TRUE
  ) |>
  modify_header(label ~ "**Variable**",
  estimate ~ "**OR**") |>
  bold_labels() |> 
  as_gt() |>
  tab_options(
    table.width = pct(100)
  ) |> 
  tab_style(
  style = cell_text(align = "left"),
  locations = cells_footnotes()
  )
uvtable
Table 3: Simple Logistic Regression Results for Each Covariate
Variable OR 95% CI p-value
Years Working 1.01 1.00, 1.02 0.0013
Physical Activity Score 0.95 0.93, 0.97 <0.001
Obesity Status


    Not Obese
    Obese 1.22 1.06, 1.39 0.0053
Abbreviation: CI = Confidence Interval

Interpretation: In univariable models, years working increases the odds of depression (OR ≈ 1.01 per year, p < 0.001) and physical activity is protective (OR ≈ 0.95 per unit, p < 0.001). Obese participants have higher unadjusted odds than non-obese participants (OR ≈ 1.22, p = 0.005). These crude associations set expectations for the multivariable models.

3.3 Multiple Logistic Regression Model

Multivariable logistic regression isolates the independent contribution of each predictor by simultaneously adjusting for all other covariates in the model. We first fit a main-effects model to estimate adjusted associations, then incorporate an interaction term to test whether the effect of physical activity varies by obesity status, reflecting theoretical expectations of effect modification.

3.3.1 Model 1: Main Effects Only

Code
model1 <- glm(depression ~ years_working + phys_activity + obesity,
              data = data,
              family = binomial(link = "logit"))
tidy(model1, exponentiate = TRUE, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))
Table 4: Model 1: Multivariable Logistic Regression Results (Main Effects Only)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.5804090 0.4302545 -1.264420 0.2061 0.2459216 1.3393287
years_working 1.0531917 0.0149993 3.455169 <0.001 1.0234057 1.0856429
phys_activity 0.7788315 0.0579217 -4.315491 <0.001 0.6925121 0.8697786
obesityObese 2.4048989 0.3247027 2.702497 0.0069 1.2796631 4.5876165

Interpretation: After adjustment for the other covariates, years working remains positively associated with depression, physical activity remains protective, and obesity is associated with higher odds (all p < 0.01). This main-effects model provides a strong baseline before testing effect modification.

3.3.2 Model 2: Main Effects + Interaction

Code
model2 <- glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,
              data = data,
              family = binomial(link = "logit"))
tidy(model2, exponentiate = TRUE, conf.int = TRUE) |>
  mutate(p.value = format.pval(p.value, digits = 2, eps = 0.001))
Table 5: Model 2: Multivariable Logistic Regression Results (With Interaction)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.9352352 0.5406653 1.221142 0.222 0.6835716 5.7896743
years_working 1.0593548 0.0161483 3.570660 <0.001 1.0273407 1.0947962
phys_activity 0.5829386 0.1024914 -5.265551 <0.001 0.4692610 0.7031637
obesityObese 0.1913267 0.6828885 -2.421732 0.015 0.0482501 0.7117911
phys_activity:obesityObese 1.6929595 0.1279853 4.113584 <0.001 1.3287480 2.2005401

Interpretation: The interaction term is significant, indicating that the protective slope of physical activity differs by obesity status. As a result, the obesity main effect now represents the contrast at physical activity = 0, and the combined effect of activity for obese participants should be interpreted using the interaction.

3.3.3 Model Comparison

3.3.3.1 Likelihood Ratio Test

Code
anova(model1, model2, test = "Chisq")
Table 6: Likelihood Ratio Test Comparing Model 1 and Model 2
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
196 229.5084 NA NA NA
195 210.2716 1 19.23678 1.15e-05

Interpretation: The likelihood ratio test (p < 0.001) (Kamarul Imran, Wan Nor Arifin, and Tengku Muhammad Hanis Tengku Mokhtar 2024) shows that adding the interaction meaningfully reduces model deviance, so the differential effect of physical activity by obesity status improves fit beyond additive main effects when explaining depression risk .

3.3.3.2 Using performance package

Code
library(performance)
compare_performance(model1, model2, rank = TRUE)
Table 7: Model Performance Comparison Between Model 1 and Model 2
Name Model R2_Tjur RMSE Sigma Log_loss Score_log Score_spherical PCP AIC_wt AICc_wt BIC_wt Performance_Score
model2 glm 0.2589601 0.4204117 1 0.5256790 -46.85270 0.0170944 0.6458200 0.9998193 0.9998096 0.9990605 0.7777778
model1 glm 0.1813535 0.4428510 1 0.5737709 -44.13219 0.0069978 0.6087279 0.0001807 0.0001904 0.0009395 0.2222222

Interpretation: The performance comparison confirms the superiority of Model 2 across multiple criteria. The interaction model posts lower AIC/BIC and higher pseudo-R², indicating better out-of-sample parsimony despite the added term. The slight reduction in RMSE and log-loss reinforces that allowing obesity to modify the physical activity slope yields more accurate probability estimates (Lüdecke et al. 2021).

3.3.4 Preliminary final model

Code
var_labels <- c(
    years_working = "Years Working (per year)",
    phys_activity = "Physical Activity Score (per unit)",
    obesityObese = "Obesity Status: Obese vs Not Obese",
    `phys_activity:obesityObese` = "Physical Activity * Obesity"
)

prelim_tbl <- tbl_regression(
    model2,
    intercept = TRUE,
    exponentiate = TRUE,
    conf.level = 0.95,
    pvalue_fun = ~style_pvalue(.x, digits = 3, eps = 0.001),
    estimate_fun = ~style_number(.x, digits = 2)
)
prelim_tbl
Table 8: Preliminary Final Model
Characteristic OR 95% CI p-value
(Intercept) 1.94 0.68, 5.79 0.222
Years Working 1.06 1.03, 1.09 <0.001
Physical Activity Score 0.58 0.47, 0.70 <0.001
Obesity Status


    Not Obese
    Obese 0.19 0.05, 0.71 0.015
Physical Activity Score * Obesity Status


    Physical Activity Score * Obese 1.69 1.33, 2.20 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation: The preliminary final model indicates that years working remains a risk factor after adjustment, whereas physical activity is protective among non-obese participants (the reference group). The interaction term clarifies that this protective gradient is not uniform: for obese participants, the net activity effect is the sum of the main activity coefficient and the interaction coefficient, so the odds ratio for activity in the obese group is exp(β_activity + β_interaction) rather than exp(β_activity) alone. Consequently, the obesity main effect should be read as the contrast at physical activity = 0, not an overall average across activity levels, which avoids a common misinterpretation.

4 Part D: Model Checking

Rigorous model diagnostics assess whether the fitted logistic regression satisfies key statistical assumptions and identify influential observations that may disproportionately affect parameter estimates. This section systematically evaluates linearity in the logit, independence of observations, absence of multicollinearity, and the presence of outliers or high-leverage points. Remedial measures are applied where necessary to ensure robust inference and valid interpretation of regression coefficients.

4.1 Assumption 1: Linearity in the Logit

The linearity assumption requires that continuous predictors have a linear relationship with the log-odds of the outcome. We assess this using the Box-Tidwell test (Shrestha 2019), which adds interaction terms between continuous predictors and their natural logarithms.

4.1.1 Box-Tidwell Test

Code
# Box-Tidwell test for linearity in the logit
# Add log-transformed predictors (avoiding log of zero/negative values)
data_linearity <- data |>
  mutate(
    years_working_log = log(years_working + 0.1),  # Add small constant to avoid log(0)
    phys_activity_log = log(phys_activity + 0.1)
  )

# Model with interaction between predictors and their logs
model_boxTidwell <- glm(
  depression ~ years_working + phys_activity + obesity + 
    phys_activity:obesity +
    years_working:years_working_log + 
    phys_activity:phys_activity_log,
  data = data_linearity,
  family = binomial(link = "logit")
)

# Test significance of interaction terms
tidy(model_boxTidwell, conf.int = TRUE) |>
  filter(grepl("_log", term)) |>
  mutate(p.value = format.pval(p.value, digits = 3, eps = 0.001)) |>
  select(term, estimate, std.error, statistic, p.value) |>
  kable(caption = "Box-Tidwell Test for Linearity")
Box-Tidwell Test for Linearity
term estimate std.error statistic p.value
years_working:years_working_log 0.0076509 0.0452454 0.1690979 0.866
phys_activity:phys_activity_log -0.2764550 0.1745105 -1.5841744 0.113

Interpretation: The Box-Tidwell terms are non-significant (p = 0.866 for years_working and p = 0.113 for phys_activity), supporting a linear logit for both continuous predictors without additional transformations.

4.1.2 Using mfp package

The mfp package provides multivariable fractional polynomial (MFP) modeling, which automatically tests whether continuous predictors require transformation to achieve linearity in the logit (Ambler and Benner 2025). If the default linear term (power = 1) is retained, this supports the linearity assumption.

Code
library(mfp)

# Fit MFP model to test for non-linear transformations
mfp_model <- mfp(depression ~ fp(years_working) + fp(phys_activity) + obesity + phys_activity:obesity,
                 data = data,
                 family = binomial(link = "logit"),
                 verbose = T)

    Variable    Deviance    Power(s)
------------------------------------------------
Cycle 1
    obesityNot Obese:phys_activity           
                210.272      
                210.272     1
                                
                                

    obesityObese:phys_activity           
                210.272      
                210.272     1
                                
                                

    years_working            
                224.411      
                210.272     1
                210.272     1
                207.43      0 0.5

    obesityObese             
                216.396      
                210.272     1
                                
                                

    phys_activity            
                210.272      
                210.272     1
                205.23      3
                200.061     0.5 0.5

Cycle 2
    obesityNot Obese:phys_activity           
                209.573      
                205.23      1
                                
                                

    obesityObese:phys_activity           
                208.857      
                205.23      1
                                
                                

    years_working            
                219.337      
                205.23      1
                205.23      1
                203.357     0 0.5

    obesityObese             
                213.114      
                205.23      1
                                
                                


Tansformation
                               shift scale
obesityNot Obese:phys_activity     0     1
obesityObese:phys_activity         0     1
years_working                      0    10
obesityObese                       0     1
phys_activity                      0    10

Fractional polynomials
                               df.initial select alpha df.final power1 power2
obesityNot Obese:phys_activity          1      1  0.05        1      1      .
obesityObese:phys_activity              1      1  0.05        1      1      .
years_working                           4      1  0.05        1      1      .
obesityObese                            1      1  0.05        1      1      .
phys_activity                           4      1  0.05        2      3      .


Transformations of covariates:
                                      formula
years_working         I((years_working/10)^1)
phys_activity         I((phys_activity/10)^3)
obesity                               obesity
obesity:phys_activity                    <NA>


Deviance table:
         Resid. Dev
Null model   268.3729
Linear model     210.2716
Final model  205.23
Code
# Display the selected transformations
summary(mfp_model)

Call:
glm(formula = depression ~ obesity + I((years_working/10)^1) + 
    I((phys_activity/10)^3), family = binomial(link = "logit"), 
    data = data)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -1.1305     0.3781  -2.990 0.002791 ** 
obesityObese              0.9264     0.3263   2.839 0.004519 ** 
I((years_working/10)^1)   0.5037     0.1480   3.403 0.000666 ***
I((phys_activity/10)^3)  -2.7344     0.6585  -4.153 3.29e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 268.37  on 199  degrees of freedom
Residual deviance: 228.98  on 196  degrees of freedom
AIC: 236.98

Number of Fisher Scoring iterations: 4

Interpretation: The MFP algorithm selected different transformations for the two continuous predictors: years_working retained the linear form (power = 1), confirming a linear relationship with the log-odds, while phys_activity was transformed with a cubic term (power = 3), suggesting a non-linear relationship. The final MFP model (deviance = 205.23) shows modest improvement over the linear model (deviance = 210.27), reducing deviance by approximately 5 points. However, given that the Box-Tidwell test (above) found the linear terms acceptable and the improvement is modest, we prioritize model parsimony and interpretability by retaining the linear specification for both predictors in our final model.

4.1.3 Visual Assessment: Logit Plots

We plot empirical logit values against each continuous predictor with smoothed trend lines to detect potential non-linearities, threshold effects, or systematic curvature that would contradict the linearity assumption.

Code
# Get predicted probabilities
probabilities <- predict(model2, type = "response")

# Calculate logit values
logit_values <- log(probabilities/(1-probabilities))

# Select only continuous predictors
data_cont <- data %>% 
  select_if(is.numeric)

# Bind logit values to continuous predictors
predictors <- cbind(data_cont, logit_values)

# Create scatter plots
par(mfrow = c(1, 2))
library(ggplot2)
ggplot(predictors, aes(x = years_working, y = logit_values)) +
  geom_point() +
  geom_smooth(method = "loess")

Code
library(ggplot2)
ggplot(predictors, aes(x = phys_activity, y = logit_values)) +
  geom_point() +
  geom_smooth(method = "loess")

Interpretation: The smoothed trends are approximately linear for both predictors, with no clear curvature or threshold effects (Harrell , 2015). This visual assessment is consistent with the Box-Tidwell results and supports retaining the original scales.

4.2 Assumption 2: Independence of Observations

Independence is primarily ensured by the study design: each participant contributes a single, unrelated observation, and there is no clustering, repeated measures, or temporal ordering that would induce serial correlation. We then complement this design-based justification with residuals vs observation order and the Durbin-Watson test as a formal check for autocorrelation (Durbin and Watson 1950) .

4.2.1 Visual diagnostics

Code
plot(residuals(model2), type = "o")
abline(h = 0, col = "red", lty = 2)
Figure 9: Residuals Over Observation Order

Interpretation: No obvious patterns or trends in the plot, suggesting that the residuals are independent.

4.2.2 Durbin-Watson Test

Code
library(lmtest)
dwtest(model2)

    Durbin-Watson test

data:  model2
DW = 1.849, p-value = 0.1421
alternative hypothesis: true autocorrelation is greater than 0

Interpretation: The Durbin-Watson statistic is 1.85 (p = 0.142), indicating no significant autocorrelation in the residuals and supporting the assumption of independent observations.

4.3 Assumption 3: Absence of Multicollinearity

Code
library(car)

# Calculate VIF for main effects model (Model 1)
vif_values <- vif(model1)

# Create a formatted table
vif_table <- data.frame(
  Variable = names(vif_values),
  VIF = round(vif_values, 2),
  Tolerance = round(1 / vif_values, 3)
) 

kable(vif_table, align = "lccc")
Table 9: Variance Inflation Factor (VIF) for Model Predictors
Variable VIF Tolerance
years_working years_working 1.04 0.958
phys_activity phys_activity 1.04 0.958
obesity obesity 1.01 0.991

Interpretation: All VIF values are near 1.0 (max ≈ 1.04), indicating virtually no multicollinearity among years working, physical activity, and obesity. This supports stable, well-estimated coefficients (Fox and Weisberg 2019).

4.4 Checking for Influential Observations

4.4.1 Generate predicted values and residuals

data.fitted <- augment(model2, type.predict = "response")

4.4.2 Diagnostic Plots

Code
par(mfrow = c(2, 2))
plot(model2)
Figure 10: Diagnostic Plots for Logistic Regression Model

Interpretation: Across the four-panel diagnostics, the residuals-versus-fitted plot shows a roughly horizontal band without strong curvature, suggesting the logit link is capturing the main functional form. The normal Q–Q plot of deviance residuals stays close to the reference line with only mild tail deviations, which is acceptable for binary outcomes and does not indicate severe lack of fit. The scale-location plot shows relatively constant spread across fitted values, implying no clear heteroscedastic pattern in the residuals. Finally, the residuals-versus-leverage panel shows only a few moderate-leverage cases, none approaching Cook’s D = 1, which supports stable estimates while justifying the targeted influence checks that follow (Fox and Weisberg 2019).

4.4.3 Cook’s distance

We use Cook’s distance to identify influential observations that may disproportionately affect the model’s estimates. The threshold for Cook’s distance is commonly set at 4/n, where n is the number of observations in the dataset.

Code
cutoff <- 4/(nrow(data))
plot(model2, which = 4, cook.levels = cutoff)
Figure 11: Cook’s Distance for Influential Observations

Interpretation: Observations 62, 97, and 84 exceed the 4/n heuristic (all Cook’s D < 0.1), so they merit inspection even though none approach the more conservative Cook’s D = 1 threshold. We examine these three points below.

Code
data[c(62, 84, 97), ]
Table 10: Details of Top Three Influential Observations Based on Cook’s Distance
depression years_working phys_activity obesity
62 1 1.035911 8.8912964 Not Obese
84 0 8.234158 0.1594874 Not Obese
97 0 39.848956 1.4748497 Not Obese

These data must be checked for data entry errors or other issues. If no issues are found, we may consider refitting the model without these points to assess their impact.

4.4.4 Residuals vs Leverage plot

Code
plot(model2, which = 5)
Figure 12: Residuals vs Leverage Plot

Interpretation: The Residuals vs. Leverage plot highlights the same cases (IDs 62, 84, and 97) with moderate leverage, all well below Cook’s D = 1. They warrant review but are unlikely to dominate the model once sensitivity analyses are performed.

4.4.5 Using DFBETAS

We use threshold of 2/sqrt(n) to identify influential observations for each predictor.

Code
# Compute DFBETAS and display as a datatable
library(DT)
dfb <- dfbetas(model2)
dfb_long <- dfb |> 
  as.data.frame() |> 
  mutate(id = row_number()) |> 
  pivot_longer(
    cols = -id,
    names_to = "term",
    values_to = "dfbeta"
  )

# Cutoff
dfb_threshold <- 2 / sqrt(nrow(data))

# Identify influential observations
influential_dfb <- dfb_long |> 
  filter(abs(dfbeta) > dfb_threshold) |>
  arrange(desc(abs(dfbeta)))
influential_dfb %>%
  datatable()
Table 11: DFBETAS for Each Observation and Predictor (Descending Order)

Interpretation: DFBETAS indicate observation 84 exerts the largest influence across multiple coefficients (|DFBETAS| ≈ 0.28–0.35), while observations 97 and 154 primarily affect the years_working and phys_activity slopes. Although these values exceed the 2/√n screening threshold, all remain well below 1, implying that no single case overturns coefficient directionality; nevertheless, they justify the downstream refit excluding high-residual, high-leverage points.

4.4.6 Using standardized residuals and leverage values cutoff

We will identify data points with standardized residuals greater than 2 or less than -2, as well as those with leverage values exceeding the calculated cutoff. Leverage cutoff is calculated as 2*(p+1)/n, where p is the number of predictors and n is the number of observations.

Code
leverage_cutoff <- 2 * (length(coef(model2)) + 1) / nrow(data)
data.fitted %>% 
  mutate(id = row_number()) %>%
    relocate(id, .before = everything()) %>%
  filter(.std.resid > 2 | .std.resid < -2 | .hat > leverage_cutoff) %>%
  datatable()
Table 12: Outliers Identified by Standardized Residuals and Leverage

We have identified 3 potential outliers (IDs 9, 62, and 97) based on standardized residuals and leverage; two of them overlap with the Cook’s distance flags, which supports a focused sensitivity check.

4.5 Remedial measures

Having identified potentially influential observations through multiple diagnostic criteria, we now implement remedial measures by refitting the model after excluding these cases. Comparing coefficient estimates, significance levels, and model fit statistics between the original and reduced datasets quantifies the extent to which outliers drive the observed associations and determines the robustness of our substantive conclusions.

4.5.1 Remove outliers and refit model

Code
data.nooutliers <- data.fitted %>%
  filter(.std.resid < 2 & .std.resid > -2 & .hat < leverage_cutoff)

model_removed <- glm(depression ~ years_working + phys_activity + obesity + phys_activity:obesity,
              data = data.nooutliers,
              family = binomial(link = "logit"))

tidy(model_removed, conf.int = TRUE) |> mutate(p.value = format.pval(p.value, digits = 3, eps = 0.001))
glance(model_removed)
Table 13: Refitted Model After Removing Outliers
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.7547721 0.5778630 1.306144 0.19150 -0.3499894 1.9356995
years_working 0.0768986 0.0179673 4.279915 < 0.001 0.0431287 0.1139254
phys_activity -0.6656821 0.1194507 -5.572862 < 0.001 -0.9206441 -0.4493722
obesityObese -2.0945113 0.7342211 -2.852698 0.00433 -3.5899548 -0.6945036
phys_activity:obesityObese 0.6478278 0.1426781 4.540486 < 0.001 0.3810161 0.9437453
null.deviance df.null logLik AIC BIC deviance df.residual nobs
263.6382 196 -95.75465 201.5093 217.9253 191.5093 192 197

Interpretation: The refitted model preserves the same effect directions as the preliminary model, and all predictors remain statistically significant. This indicates the substantive conclusions are not driven by a small number of influential observations.

4.5.2 Compare coefficient changes after removing outliers

Code
coef_comparison <- tidy(model2) %>%
  select(term, estimate) %>%
  rename(estimate_prelim = estimate) %>%
  left_join(
    tidy(model_removed) %>%
      select(term, estimate) %>%
      rename(estimate_refit = estimate),
    by = "term"
  ) %>%
  mutate(
    change_in_estimate = estimate_refit - estimate_prelim,
    percent_change = (change_in_estimate / estimate_prelim) * 100
  )
coef_comparison
Table 14: Changes in Coefficients After Removing Outliers
term estimate_prelim estimate_refit change_in_estimate percent_change
(Intercept) 0.6602289 0.7547721 0.0945432 14.31976
years_working 0.0576601 0.0768986 0.0192385 33.36541
phys_activity -0.5396735 -0.6656821 -0.1260086 23.34905
obesityObese -1.6537727 -2.0945113 -0.4407385 26.65049
phys_activity:obesityObese 0.5264782 0.6478278 0.1213496 23.04931

Interpretation: After removing the three flagged observations, the coefficient estimates shift modestly but preserve the same direction and statistical significance. The largest relative movement appears in the obesity and interaction terms, which is expected because these effects are most sensitive to extreme subgroup patterns. Importantly, confidence intervals remain on the same side of the null, indicating that the substantive conclusions are robust even though point estimates adjust slightly.

4.5.3 Compare model fitness between preliminary final model with model after removing outliers

  • Preliminary model
Code
model_performance(model2)
Table 15: Performance Metrics for Preliminary Model
AIC AICc BIC R2_Tjur RMSE Sigma Log_loss Score_log Score_spherical PCP
220.2716 220.5809 236.7632 0.2589601 0.4204117 1 0.525679 -46.8527 0.0170944 0.64582
  • Refitted model after outlier removal
Code
model_performance(model_removed)
Table 16: Performance Metrics for Refitted Model
AIC AICc BIC R2_Tjur RMSE Sigma Log_loss Score_log Score_spherical PCP
201.5093 201.8234 217.9253 0.3101762 0.4079531 1 0.4860642 -48.32047 0.0222623 0.6715209
  • Using compare_performance function
Code
compare_performance(model2, model_removed, rank = TRUE)
Table 17: Side-by-Side Performance Comparison of Models
Name Model R2_Tjur RMSE Sigma Log_loss Score_log Score_spherical PCP AIC_wt AICc_wt BIC_wt Performance_Score
model_removed glm 0.3101762 0.4079531 1 0.4860642 -48.32047 0.0222623 0.6715209 0.9999157 0.9999155 0.9999188 0.7777778
model2 glm 0.2589601 0.4204117 1 0.5256790 -46.85270 0.0170944 0.6458200 0.0000843 0.0000845 0.0000812 0.2222222

Interpretation: After removing the three flagged cases, AIC drops to ~201.5 and pseudo-R² improves while all covariates remain significant, indicating the outliers added noise rather than signal and that the primary associations are robust.

4.6 Check Model Fitness

Model fitness assessment evaluates both discrimination (the model’s ability to distinguish between depressed and non-depressed individuals) and calibration (the agreement between predicted probabilities and observed outcomes). We employ complementary metrics including classification accuracy, the Hosmer-Lemeshow goodness-of-fit test, and area under the receiver operating characteristic curve to comprehensively characterize predictive performance and validate the final model specification.

4.6.1 Confusion Matrix and Classification Metrics

We evaluate the model’s predictive performance using a confusion matrix and calculate key classification metrics including accuracy, sensitivity, and specificity.

Code
library(caret)
# Generate predicted probabilities
pred_prob <- predict(model_removed, type = "response")
# Convert to binary predictions using 0.5 threshold
pred_class <- ifelse(pred_prob > 0.5, 1, 0)

# Create confusion matrix
cm <- confusionMatrix(factor(pred_class, levels = c(0, 1)),
                      factor(data.nooutliers$depression, levels = c(0, 1)),
                      positive = "1")
cm
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 95 28
         1 25 49
                                          
               Accuracy : 0.731           
                 95% CI : (0.6633, 0.7915)
    No Information Rate : 0.6091          
    P-Value [Acc > NIR] : 0.0002244       
                                          
                  Kappa : 0.431           
                                          
 Mcnemar's Test P-Value : 0.7835305       
                                          
            Sensitivity : 0.6364          
            Specificity : 0.7917          
         Pos Pred Value : 0.6622          
         Neg Pred Value : 0.7724          
             Prevalence : 0.3909          
         Detection Rate : 0.2487          
   Detection Prevalence : 0.3756          
      Balanced Accuracy : 0.7140          
                                          
       'Positive' Class : 1               
                                          

Interpretation: The refitted model correctly classifies 73.1% of cases, with balanced specificity (0.79) and sensitivity (0.64). This indicates acceptable discrimination without sacrificing the ability to identify depressed individuals.

4.6.2 Hosmer-Lemeshow Goodness-of-Fit Test

The Hosmer-Lemeshow test assesses whether the observed event rates match the expected event rates across deciles of predicted probabilities. A non-significant result (p > 0.05) indicates good model fit.

Code
library(ResourceSelection)
# Perform Hosmer-Lemeshow test
hl_test <- hoslem.test(data.nooutliers$depression, pred_prob, g = 10)
hl_test

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  data.nooutliers$depression, pred_prob
X-squared = 6.0923, df = 8, p-value = 0.6369

Interpretation: The Hosmer-Lemeshow test yields p = 0.637, indicating no evidence of lack of fit; observed and expected depression counts align across deciles of predicted risk.

4.6.3 Area under Receiver Operating Characteristic (ROC) Curve

The ROC curve plots sensitivity against (1 - specificity) across all possible classification thresholds. The Area Under the Curve (AUC) quantifies overall model discrimination ability.

Code
library(pROC)

# Create ROC object
roc_obj <- roc(data.nooutliers$depression, pred_prob)
auc_value <- auc(roc_obj)
ci_auc <- ci.auc(roc_obj)

# Plot ROC curve
roc_model <- roc(data.nooutliers$depression, pred_prob, 
plot = TRUE, print.auc = TRUE, ci = TRUE, legacy.axes = TRUE)
Figure 13: ROC Curve for final model

Interpretation: The ROC analysis shows strong discrimination (AUC = 0.823, 95% CI: 0.767–0.880), indicating the model reliably separates depressed from non-depressed participants across thresholds.

4.7 Model Performance Summary

The final model demonstrates excellent discriminative ability (AUC = 82.3%, 95% CI: 76.7%-88.0%) and good calibration (Hosmer-Lemeshow p = 0.637), with 73.1% overall classification accuracy. Removing the flagged observations improved fit (AIC: 220.3 → 201.5) without changing effect directions, confirming robustness. The absence of multicollinearity (all VIF < 1.05) ensures stable coefficient estimates and reliable inference.

5 Part E: Results Presentation

This section synthesizes the findings from diagnostic-validated models into a comprehensive presentation of final parameter estimates, effect interpretations, and practical implications. We present the regression equation, adjusted odds ratios with confidence intervals, and visualizations that elucidate the interaction between physical activity and obesity status. Model-based predictions across the covariate space illustrate absolute risk differences and support clinical translation of the statistical findings.

5.1 Final model

Code
library(gt)
finalmvtable <- tbl_regression(
  model_removed,
  exponentiate = FALSE,
  pvalue_fun = label_style_pvalue(digits = 3)) %>%
  bold_labels() %>%
  modify_table_body(
  ~ .x %>%
  mutate(
    B = ifelse(row_type == "level" & is.na(estimate), "Ref.", round(estimate, 3)), 
    SE = ifelse(row_type == "level" & is.na(std.error), "", round(std.error, 3)),
    expB = ifelse(row_type == "level" & is.na(estimate), "Ref.", round(exp(estimate), 2)),
    wald_stat = ifelse(row_type == "level" & is.na(statistic), "", round(statistic^2, 3)),
    exp.cilow = ifelse(row_type == "label" | row_type == "level", round(exp(conf.low), 2), ""), 
    exp.cihi = ifelse(row_type == "label"| row_type == "level", round(exp(conf.high), 2), "") 
  ) %>%
  relocate(B, .before = estimate) %>%
  relocate(SE, .after = B) %>%
  relocate(expB, .after = SE) %>%
  relocate(wald_stat, .before = p.value)
  ) %>%
  modify_column_hide(columns = c("estimate", "std.error", "statistic", "exp.cilow", "exp.cihi", "conf.low", "conf.high")) %>%
  modify_column_unhide(columns = c(B, SE, expB, wald_stat)) %>%
  modify_header(label = "**Variables**", B = "**B**", SE = "**SE**", expB = "**Adj. OR (95%CI)**", wald_stat = "**Wald stat**") %>%
  modify_table_styling(
  column = expB,
  rows = !is.na(estimate),
  cols_merge_pattern = "{expB} ({exp.cilow}, {exp.cihi})"
  ) %>%
  as_gt() %>%
  tab_footnote(
  footnote = paste0("Constant = ", round(coef(model_removed)[1], 3))
  ) %>%
  tab_footnote(
  footnote = "No multicollinearity"
  ) %>%
  tab_footnote(
  footnote = paste0("Hosmer-Lemeshow test, p-value = ", round(hl_test$p.value, 3))
  ) %>%
  tab_footnote(
  footnote = paste0("Classification table accuracy = ", round(cm$overall["Accuracy"]*100, 2), "%") 
  ) %>%
  tab_footnote(
  footnote = paste0("Area Under Receiver Operating Characteristics (ROC) was ", round(auc_value * 100, 1), "%")
  ) %>%
  tab_footnote(
  footnote = paste0("AIC = ", round(AIC(model_removed), 2))
  ) %>%
  tab_footnote(
  footnote = paste0("Nagelkerke's R² = ", round(performance::r2_nagelkerke(model_removed), 3))
  ) %>%
  opt_align_table_header(align = "left") %>%
  tab_options(
  table_body.hlines.style = "none",
  table.width = pct(100)
  ) %>%
  tab_style(
  style = cell_text(align = "right"),
  locations = cells_body(columns = -label)
  ) %>%
  tab_style(
  style = cell_text(align = "left"),
  locations = cells_footnotes()
  )
finalmvtable
Table 18: Final Multivariable Logistic Regression Model for Depression
Variables B SE Adj. OR (95%CI) Wald stat p-value
Years Working 0.077 0.018 1.08 (1.04, 1.12) 18.318 <0.001
Physical Activity Score -0.666 0.119 0.51 (0.40, 0.64) 31.057 <0.001
Obesity Status




    Not Obese Ref. Ref.
    Obese -2.095 0.734 0.12 (0.03, 0.50) 8.138 0.004
Physical Activity Score * Obesity Status




    Physical Activity Score * Obese 0.648 0.143 1.91 (1.46, 2.57) 20.616 <0.001
Constant = 0.755
No multicollinearity
Hosmer-Lemeshow test, p-value = 0.637
Classification table accuracy = 73.1%
Area Under Receiver Operating Characteristics (ROC) was 82.3%
AIC = 201.51
Nagelkerke's R² = 0.416
Abbreviation: CI = Confidence Interval

5.2 Final model equation

5.2.1 Odds

\[ \begin{aligned} \text{Odds}(\text{Depression}) = \frac{p}{1-p} &= \exp[0.755 + 0.077 \times \text{Years Working} \\ &\quad - 0.666 \times \text{Physical Activity} - 2.095 \times \text{Obese} \\ &\quad + 0.648 \times (\text{Physical Activity} \times \text{Obese})] \end{aligned} \]

5.2.2 Probability

\[ P(\text{Depression}=1)=\frac{\exp\!\left(\begin{aligned} &0.755+0.077\times\text{Years Working}\\ &-0.666\times\text{Physical Activity}\\ &-2.095\times\text{Obese}\\ &+0.648\times(\text{Physical Activity}\times\text{Obese}) \end{aligned}\right)}{1+\exp\!\left(\begin{aligned} &0.755+0.077\times\text{Years Working}\\ &-0.666\times\text{Physical Activity}\\ &-2.095\times\text{Obese}\\ &+0.648\times(\text{Physical Activity}\times\text{Obese}) \end{aligned}\right)} \]

5.3 Interpretation of Final Model

5.3.1 Main Effects

  • Years Working (β = 0.077, p < 0.001, OR = 1.08, 95% CI: 1.04-1.12)

    • For every additional year of working, the odds of depression increase by 8% (OR = 1.08), holding all other variables constant. This positive association suggests that prolonged occupational exposure represents a cumulative risk factor for depression, consistent with theories of occupational stress and burnout. The narrow confidence interval (1.04-1.12) indicates precise estimation of this effect. In practical terms, an individual who has worked for 30 years has approximately 2.4 times higher odds of depression compared to someone who has worked for 10 years (1.08^20 = 2.44), assuming all other factors remain equal.
  • Physical Activity Score (β = -0.666, p < 0.001, OR = 0.51, 95% CI: 0.40-0.64)

    • Among non-obese individuals (reference category), each one-unit increase in physical activity score is associated with a 49% reduction in the odds of depression (OR = 0.51). This strong protective effect aligns with extensive epidemiological evidence linking physical activity to improved mental health outcomes. The narrow confidence interval (0.40-0.64) excludes one, confirming the robustness of this protective association.
  • Obesity Status (β = -2.095, p = 0.004, OR = 0.12, 95% CI: 0.03-0.50)

    • The main effect of obesity represents the odds ratio when physical activity equals zero. At this reference point, obese individuals have 88% lower odds of depression compared to non-obese individuals (OR = 0.12). However, this interpretation must be considered alongside the significant interaction term, as the effect of obesity varies across different levels of physical activity. The negative coefficient appears counterintuitive in isolation but becomes meaningful when examining the interaction effect.

5.3.2 Interaction Effect

Physical Activity × Obesity (β = 0.648, p < 0.001, OR = 1.91, 95% CI: 1.46-2.57)

The positive interaction coefficient (β = 0.648) indicates that the protective effect of physical activity is significantly attenuated among obese individuals compared to non-obese individuals. Specifically, for obese individuals, the net effect of physical activity on depression is:

  • Combined coefficient: -0.666 + 0.648 = -0.018
  • Combined OR: exp(-0.018) = 0.98

This reveals that among obese individuals, each one-unit increase in physical activity is associated with only a 2% reduction in the odds of depression (OR = 0.98), compared to the 49% reduction observed in non-obese individuals (OR = 0.51). The interaction OR of 1.91 indicates that the slope of the physical activity-depression relationship differs by a factor of 1.91 between obesity groups.

Possible Explanation: This interaction may reflect physiological constraints whereby obesity-related metabolic dysfunction (chronic inflammation, insulin resistance, altered adipokine profiles) diminishes the neurobiological benefits of physical activity. Alternatively, obese individuals may experience reduced exercise intensity or frequency due to physical limitations, thereby limiting the dose-response relationship between activity and mental health outcomes.

Biological and Social Plausibility of the Interaction: The inclusion of this interaction term is justified on both physiological and behavioral grounds. Biologically, obesity is characterized by chronic low-grade inflammation, insulin resistance, and dysregulated adipokine signaling which are metabolic disturbances that may dampen the neurobiological benefits typically conferred by physical activity, such as enhanced neuroplasticity, improved cerebral blood flow, and endorphin release. Socially and behaviorally, obese individuals may face barriers to sustained, high-intensity physical activity due to joint pain, cardiovascular limitations, or stigma-related avoidance of exercise settings, thereby limiting the dose and quality of activity needed to achieve mental health benefits. Additionally, the psychological burden of obesity itself (e.g., body dissatisfaction, perceived discrimination) may overshadow or neutralize the mood-enhancing effects of physical activity.

5.3.2.1 Interaction Plot (Hosmer, Lemeshow, and Sturdivant 2013)

Code
library(interactions)

# Create interaction plot using interact_plot
interact_plot(
  model_removed,
  pred = phys_activity,
  modx = obesity,
  colors = c("#2E86AB", "#A23B72"),
  x.label = "Physical Activity Score",
  y.label = "Predicted Probability of Depression",
  legend.main = "Obesity Status",
  main.title = "Interaction Between Physical Activity and Obesity Status",
  modx.labels = c("Not Obese", "Obese")
) 
Figure 14: Interaction Between Physical Activity and Obesity Status on Depression Status

Interpretation of Interaction Plot: Among non-obese individuals, the predicted probability of depression exhibits a pronounced negative gradient, declining steeply from approximately 90% at the lowest physical activity levels (score ≈ 0) to near 0% at the highest activity levels (score ≈ 10). This drastic reduction across the activity continuum exemplifies the potent dose-response relationship whereby incremental increases in physical activity yield substantial mental health benefits in metabolically healthy individuals.

In contrast, the obese group displays a relatively flat trajectory across the entire physical activity spectrum, with predicted depression probability remaining stable between 50-55% regardless of activity level. The near-horizontal slope confirms the statistical finding that physical activity confers negligible protective benefit against depression in the presence of obesity (OR ≈ 0.98 per unit). Notably, at low physical activity levels (scores < 2), both groups exhibit elevated depression risk, suggesting that sedentary behavior represents a universal risk factor. However, the lines diverge markedly as activity increases beyond this threshold.

The crossover pattern reveals a critical inflection point at approximately physical activity score = 3, where the non-obese trajectory intersects the obese trajectory. Beyond this point, the absolute risk differential widens progressively. This expanding divergence quantifies the magnitude of effect modification and highlight the clinical imperative of addressing obesity as a prerequisite for realizing the full mental health benefits of physical activity interventions.

5.3.3 Estimated Odds of Depression as a Function of Physical Activity Score by Obesity Status

Code
library(gt)
library(tidyverse)

# Create prediction data for all PA levels (0-10) and both obesity groups
pred_data <- expand.grid(
  phys_activity = 0:10,
  obesity = factor(c("Not Obese", "Obese"), levels = c("Not Obese", "Obese")),
  years_working = mean(data.nooutliers$years_working)
)

# Get predicted probabilities
pred_data$prob <- predict(model_removed, newdata = pred_data, type = "response")

# Calculate odds
pred_data <- pred_data %>%
  mutate(odds = prob / (1 - prob))

# Reshape to wide format
odds_table <- pred_data %>%
  select(phys_activity, obesity, odds) %>%
  pivot_wider(names_from = obesity, values_from = odds) %>%
  rename(
    `Physical Activity Score` = phys_activity,
    `Not Obese` = `Not Obese`,
    `Obese` = Obese
  )

# Create gt table
odds_table %>%
  gt() %>%
  fmt_number(columns = c(`Not Obese`, Obese), decimals = 2) %>%
  tab_spanner(
    label = "Odds for Depression",
    columns = c(`Not Obese`, Obese)
  ) %>%
  tab_footnote(
    footnote = "Odds calculated holding years working constant at the mean"
  ) %>%
  tab_style(
    style = cell_text(align = "center"),
    locations = cells_body(columns = c(`Not Obese`, Obese))
  ) %>%
  tab_style(
    style = cell_text(align = "center"),
    locations = cells_column_labels(columns = c(`Not Obese`, Obese))
  ) %>%
  tab_style(
    style = cell_text(align = "center"),
    locations = cells_column_spanners()
  ) %>%
    tab_options(
    table.width = pct(100)
  ) %>%
  opt_align_table_header(align = "left")
Table 19: Estimated Odds of Depression as a Function of Physical Activity Score by Obesity Status
Physical Activity Score
Odds for Depression
Not Obese Obese
0 9.29 1.14
1 4.77 1.12
2 2.45 1.10
3 1.26 1.08
4 0.65 1.06
5 0.33 1.05
6 0.17 1.03
7 0.09 1.01
8 0.05 0.99
9 0.02 0.97
10 0.01 0.96
Odds calculated holding years working constant at the mean

Interpretation: Read alongside the interaction plot, this table puts concrete numbers on the effect modification. Holding years working at its mean, the non-obese odds drop from about 9.29 at activity = 0 to 0.0119 at activity = 10 (roughly a 780-fold reduction), showing a steep dose-response pattern. In contrast, the obese odds shift only slightly across the same range (about 1.14 to 0.96; ~20% change), which confirms the near-flat slope seen in the plot. The table also reveals the crossover: at activity = 0, non-obese odds are ~8× higher than obese, but by activity = 10 the pattern reverses, with obese odds about 80× higher than non-obese. This numeric contrast enriches the plot by highlighting where risk rankings flip and how rapidly the gap widens with increasing activity.

5.4 Prediction

Model-based predictions translate regression coefficients into concrete risk estimates across clinically relevant covariate profiles. By systematically varying years working, physical activity, and obesity status, we generate predicted probabilities that quantify absolute depression risk for diverse scenarios, facilitating risk stratification and supporting evidence-based clinical decision-making.

5.4.1 Create new data frame for prediction

Code
library(DT)
new_data <- expand.grid(
  years_working = seq(0, 40, by = 10),
  phys_activity = seq(0, 10, by = 1),
  obesity = factor(c("Not Obese", "Obese"), levels = c("Not Obese", "Obese"))
)
new_data |> datatable()
Table 20: Prediction Data Frame for Depression Probability Estimation

Interpretation: This grid spans realistic combinations of years working and physical activity for both obesity groups, enabling scenario-based risk comparisons rather than single-point estimates.

5.4.2 Generate predicted probabilities

Code
pred.data <- augment(model_removed, newdata = new_data, type.predict = "response")
pred.data |> datatable()
Table 21: Predicted Probabilities of Depression

Interpretation: The predicted probabilities provide a structured lookup for risk profiling. They show that increasing years working consistently elevates risk, while the protective gradient of physical activity is most pronounced in the non-obese group.

5.4.3 Prediction Heatmap

Code
library(ggplot2)

# Create heatmap
ggplot(pred.data, aes(x = phys_activity, y = years_working, fill = .fitted)) +
  geom_tile() +
  facet_wrap(~obesity, ncol = 2) +
  scale_fill_gradient2(
    low = "#2E86AB",
    mid = "#F4A261", 
    high = "#A23B72",
    midpoint = 0.5,
    limits = c(0, 1),
    name = "Predicted\nProbability"
  ) +
  scale_x_continuous(breaks = seq(0, 10, 2)) +
  scale_y_continuous(breaks = seq(0, 40, 10)) +
  labs(
    title = "Predicted Probability of Depression Across Risk Factor Combinations",
    x = "Physical Activity Score",
    y = "Years Working",
    subtitle = "Darker colors indicate higher depression risk"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 10, color = "gray30"),
    legend.position = "right",
    strip.text = element_text(face = "bold", size = 12),
    panel.grid = element_blank()
  )
Figure 15: Predicted Probability of Depression by Physical Activity and Years Working

Interpretation: The heatmap visualizes the predicted probability of depression across all combinations of physical activity (0-10) and years working (0-40), stratified by obesity status.

For Not Obese individuals, there is a clear gradient pattern: depression risk decreases dramatically as physical activity increases, while it increases modestly with more years working. The top-left corner shows the highest risk, while the bottom-right corner shows the lowest risk.

For Obese individuals, the heatmap shows much less variation with physical activity—the colors remain relatively consistent across different activity levels, confirming that physical activity provides minimal protective benefit. The primary variation is along the years working axis, showing that occupational stress accumulation affects obese individuals regardless of their activity level.

5.5 Practical Significance and Clinical Implications

5.5.1 Public Health Relevance

Magnitude of Risk Factors: The observed associations translate into clinically meaningful differences in depression risk. Using the final model, we can estimate depression probability for prototypical profiles:

  • Low-risk profile (5 years working, high physical activity [score = 8], not obese):
    • Predicted probability = 1.5%
  • High-risk profile (30 years working, low physical activity [score = 2], not obese):
    • Predicted probability = 84.9%

This ~55-fold difference in depression risk highlights the substantial public health burden associated with prolonged occupational stress and sedentary lifestyles.

Effect Modification by Obesity: The attenuated benefit of physical activity among obese individuals represents a critical finding for intervention design. While non-obese workers can substantially reduce depression risk through increased physical activity (OR = 0.51 per unit), obese workers experience minimal mental health benefits from the same behavioral change (OR = 0.98 per unit). This suggests that obesity management must precede or accompany physical activity interventions to maximize mental health outcomes.

5.5.2 Clinical and Policy Recommendations

Targeted Screening: Our model identifies high-risk subgroups warranting enhanced depression screening: workers with >20 years of occupational exposure, individuals with low physical activity scores (<3), and particularly non-obese individuals in this category who would benefit most from activity-based interventions.

Differentiated Interventions: 1. For non-obese individuals: Physical activity promotion represents a highly effective depression prevention strategy, with each additional activity unit providing meaningful risk reduction 2. For obese individuals: Dual-focus interventions addressing both weight management and mental health are necessary, as physical activity alone provides negligible mental health protection in this group

Occupational Health Programs: The consistent effect of years working (OR = 1.08 per year) across all subgroups emphasizes the need for workplace-based mental health programs that address cumulative occupational stressors through job redesign, workload management, and organizational support systems.

5.5.3 Limitations and Contextual Considerations

1. Cross-Sectional Nature: While our model identifies strong statistical associations (Nagelkerke R² = 0.416, AUC = 82.3%), causality cannot be definitively established. Reverse causation (depression leading to reduced physical activity or weight gain) and unmeasured confounding (socioeconomic factors, genetic predisposition) represent alternative explanations.

2. Generalizability: The synthetic dataset, while methodologically rigorous, simulates a specific population structure. Application to other occupational settings or demographic groups requires validation.

3. Interaction Complexity: The obesity-physical activity interaction, while statistically significant, may reflect multiple underlying mechanisms (physiological, behavioral, or measurement-related) that warrant further investigation in longitudinal studies.

6 Conclusion

This comprehensive logistic regression analysis elucidates the complex interplay between years working, physical activity, and obesity status in shaping depression risk. The identification of significant main effects and a robust interaction effect provides nuanced insights into how these factors jointly influence mental health outcomes. The final model demonstrates strong predictive performance and clinical relevance, highlighting key risk factors amenable to intervention. The findings underscore the critical importance of tailored public health strategies that consider individual risk profiles, particularly the modifying role of obesity in attenuating the mental health benefits of physical activity. Future research should aim to validate these findings in diverse populations and explore the underlying mechanisms driving these associations.

7 Github Repository

GitHub Repository

The complete code and data for this analysis are available at:

https://github.com/drfittri/regressionanalysis_gdt500

8 References

Ambler, Gareth, and Axel Benner. 2025. “Mfp: Multivariable Fractional Polynomials.” https://doi.org/10.32614/CRAN.package.mfp.
Durbin, J., and G. S. Watson. 1950. “Testing for Serial Correlation in Least Squares Regression: i.” Biometrika 37 (3/4): 409. https://doi.org/10.2307/2332391.
Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third edition. Los Angeles London New Delhi Singapore Washington, DC Melbourne: SAGE.
Harrell , Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-19425-7.
Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. Applied Logistic Regression. 3. Aufl. Wiley Series in Probability and Statistics. Hoboken, N.J: Wiley.
Kamarul Imran, Wan Nor Arifin, and Tengku Muhammad Hanis Tengku Mokhtar. 2024. Data Analysis in Medicine and Health Using r. https://bookdown.org/drki_musa/dataanalysis/binary-logistic-regression.html#models-comparison.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. Performance: An r Package for Assessment, Comparison and Testing of Statistical Models” 6: 3139. https://doi.org/10.21105/joss.03139.
Shrestha, Noora. 2019. “Application of Binary Logistic Regression Model to Assess the Likelihood of Overweight.” American Journal of Theoretical and Applied Statistics 8 (1): 18–25. https://doi.org/10.11648/j.ajtas.20190801.13.